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Abstract 

hJ 

Integration is affected by the curse of dimensionality and quickly becomes intractable as 
the dimensionality of the problem grows. We propose a randomized algorithm that, with high 

l — 1 probability, gives a constant-factor approximation of a general discrete integral defined over an 

exponentially large set. This algorithm relies on solving only a small number of instances of a 
discrete combinatorial optimization problem subject to randomly generated parity constraints 
used as a hash function. As an application, we demonstrate that with a small number of MAP 

f" — queries we can efficiently approximate the partition function of discrete graphical models, which 

can in turn be used, for instance, for marginal computation or model selection. 

O 1 Introduction 

m 

Computing integrals in very high dimensional spaces is a fundamental and largely unsolved problem 
of scientific computation [H [23] , with numerous applications ranging from machine learning and 
^ statistics to biology and physics. As the volume grows exponentially in the dimensionality, the 

problem quickly becomes computationally intractable, a phenomenon traditionally known as the 
curse of dimensionality [2]. 

We revisit the problem of approximately computing discrete integrals, namely weighted sums 
over (extremely large) sets of items. This problem encompasses several important probabilistic 
inference tasks, such as computing marginals or normalization constants (partition function) in 
graphical models, which are in turn the cornerstones for parameter and structure learning |32j. 
Although we focus on the discrete case, the continuous case can in principle also be addressed, as it 
can be approximated by numerical integration. There are two common approaches to approximate 
these large discrete sums: sampling and variational methods. Variational methods [17} I32j. often 
inspired by statistical physics, are very fast but do not provide guarantees on the quality of the 
results. Since sampling and counting can be reduced to each other [16], approximate techniques 
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based on sampling are quite popular, but they suffer from similar issues because the number 
of samples required to obtain a statistically reliable estimate often grows exponentially in the 
problem size. Among sampling techniques, Markov Chain Monte Carlo (MCMC) methods are 
asymptotically accurate, but guarantees for practical applications exist only in a limited number 
of cases (fast mixing chains) I18j . They are therefore often used in an heuristic manner. In 
practice, their performance crucially depends on the choice of the proposal distributions, which 
often must be domain-specific and expert-designed [91 [21] . 

We introduce a randomized scheme that computes with high probability (1 — 5 for any desired 
5 > 0) an approximately correct estimate (within a factor of 1 + e for any desired e > 0) for 
general weighted sums defined over exponentially large sets of items, such as the set of all possible 
variable assignments in a discrete probabilistic graphical model. From a computational complexity 
perspective, the counting problem we consider is complete for the #P complexity class [28], a set 
of problems encapsulating the entire Polynomial Hierarchy and believed to be significantly harder 
than NP. 

The key idea is to reduce this #P problem to a small number (polynomial in the dimensionality) 
of instances of a (NP-hard) combinatorial optimization problem defined on the same space and 
subject to randomly generated "parity" constraints. The rationale behind this approach is that 
although combinatorial optimization is intractable in the worst case, it has witnessed great success in 
the past 50 years in fields such as Mixed Integer Programming (MIP) and propositional Satisfiability 
Testing (SAT). Problems such as computing a Maximum a Posteriori (MAP) assignment, although 
NP-hard, can in practice often be approximated [25J or solved exactly fairly efficiently |22| 123]. In 
fact, modern solvers can exploit structure in real- world problems and prune large portions of the 
search space, often dramatically reducing the runtime. In contrast, in a #P counting problem such 
as computing a marginal probability, one needs to consider contributions of an exponentially large 
number of items. 

Our algorithm, called Weighted-Integrals- And-Sums-By-Hashing (WISH), relies on randomized 
hashing techniques to "evenly cut" a high dimensional space. Such hashing was introduced by 
Valiant and Vazirani [29J to study the relationship between the number of solutions and the hardness 
of a combinatorial search. These techniques were also applied by Gomes et al. IT2] to obtain 
bounds on the number of solutions for the SAT problem. Our work is more general in that it 
can handle general weighted sums, such as the ones arising in probabilistic inference for graphical 
models. Our work is also closely related to recent work by Hazan and Jaakkola [14] , who obtain 
a lower bound on the partition function by taking suitable expectations of a combination of MAP 
queries over randomly perturbed models. We improve upon this in two crucial aspects, namely, our 
estimate is a constant factor approximation of the true partition function (while their bounds have 
no tightness guarantee), and we provide a concentration result showing that our bounds hold not 
just in expectation but with high probability with a polynomial number of MAP queries. Note that 
this is consistent with known complexity results regarding #P and BPP NP ; see Remark 1 below. 

We demonstrate the practical efficacy of the WISH algorithm in the context of computing the 
partition function of random Clique-structured Ising models, Grid Ising models with known ground 
truth, and a challenging combinatorial application (Sudoku puzzle) completely out of reach of 
techniques such as Mean Field and Belief Propagation. We also consider the Model Selection 
problem in graphical models, specifically in the context of hand-written digit recognition. We show 
that our "anytime" and highly parallelizable algorithm can handle these problems at a level of 
accuracy and scale well beyond the current state of the art. 
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2 Problem Statement and Assumptions 



Let E be a (large) set of items. Let w : E — )■ M + be a non- negative function that assigns a weight 
to each element of E. We wish to (approximately) compute the total weight of the set, defined as 
the following discrete integral or "partition function" 

o-es 

We assume uu is given as input and that it can be compactly represented, for instance in a factored 
form as the product of conditional probabilities tables. Note however that our results are more 
general and do not rely on a factored representation. 

Assumption: We assume to have access to an optimization oracle that can solve the following 
constrained optimization problem 

maxw(a)l {c} (a) (2) 

where l/<n : E — > {0, 1} is an indicator function for a compactly represented subset C C E, i.e., 
l{C}( cr ) = 1 iff c G C For concreteness, we discuss our setup and assumptions in the context 
probabilistic graphical models, which is our motivating application. 

2.1 Inference in Graphical Models 

We consider a graphical model specified as a factor graph with N = \ V\ discrete random variables 
Xi,i G V where X{ G Xi- The global random vector x = {x s ,s G V} takes value in the cartesian 
product X = X\ x Xi x • ■ • x X^. We consider a probability distribution over x G X (called 
configurations) p(x) = 4 IIqgx ^({^jo) that factors into potentials or factors ip a : {x} a h-» IR + , 
where X is an index set and {x} a C V a subset of variables the factor ?/> Q depends on, and Z is a 
normalization constant known as the partition function. 

Given a graphical model, we let E = X be the set of all possible configurations (variable 
assignments). Define a weight function w : X — > M + that assigns to each configuration a score 
proportional to its probability: w(x) = Y\ ae x ipa({%}a)- Z may then be rewritten as 

Z=^2w(x) = ^2HM{xU) (3) 

Computing Z is typically intractable because it involves a sum over an exponential number of con- 
figurations, and is often the most challenging inference task for many families of graphical models. 
Computing Z is however needed for many inference and learning tasks, such as evaluating the like- 
lihood of data for a given model, computing marginal probabilities, and parameter estimation |32j. 

In the context of graphical models inference, we assume to have access to an optimization oracle 
that can answer Maximum a Posteriori (MAP) queries, namely, solve the following constrained 
optimization problem 

argmaxp(x | C) 

x&X 

that is, we can find the most likely state (and its weight) given some evidence C. This is a strong 
assumption because MAP inference is known to be an NP-hard problem in general. Notice however 
that computing Z is a #P-complete problem, a complexity class believed to be even harder than 
NP. 
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2.2 Quadratures of Integrals 



Suppose we are given a quadrature for a continuous (multidimensional) integral of a function 
/ : W 1 — > M + over a high dimensional set S C M. n 



where X is some discretization of S (e.g., grid based), and w{x) approximates the integral of /(x) 
over the corresponding element of volume. In this case, we require a compact representation for w 
and access to an oracle able to optimize the discretized function, subject to arbitrary constraints. 
See, e.g., Figure [T] 

For simplicity, in the following we will restrict ourselves to the binary case, i.e., E = X = {0, l} n . 
The general multinomial case where the sum is over X\ x X2 x • • • x Xn can be transformed into the 
former case using a binary representation, requiring |~log 2 \Xi\\ bits (binary variables) per dimension 
i. 

3 Preliminaries 

We review some results on the construction and properties of universal hash functions; cf. |10| 127] . 
A reader already familiar with these results may skip to the next section. 

Definition 1. A family of functions % = {h : {0, l} n — > {0, l} m } is pairwise independent if the 
following two conditions hold when H ^— R 7~L is a function chosen uniformly at random from 7-L. 1) 
Vx G {0,l} n , the random variable H(x) is uniformly distributed in {0, l} m . 2) \/x±,X2 G {0, l} n 
x\ 7^ X2, the random variables H{x\) and H{x2) are independent. 

A simple way to construct such a function is to think about the family T~L of all possible func- 
tions {0, l} n — > {0, \} m . This is a family of not only pairwise independent but fully independent 
functions. However, each function requires m2 n bits to be represented, and is thus impractical in 
the typical case where n is large. On the other hand, pairwise independent hash functions can be 
constructed and represented in a much more compact way as follows; see Appendix for a proof. 



Proposition 1. Let A G {0, l} mxn ; b G {0,l} m . The family U = {h A ,b(x) : {0, l} n {0, l}" 1 } 



where hA,b{x) = Ax + b mod 2 is a family of pairwise independent hash functions. 

The space C = {x : h^{x) = p} has a nice geometric interpretation as the translated nullspace 
of the random matrix A. It is therefore a finite dimensional vector space, with operations defined 
on the field F(2) (arithmetic modulo 2). We will refer to constraints in the form Ax = b mod 2 
as parity constraints, as they can be rewritten in terms of XORs operations as Anx\ © A^X2 © 

© Ai n x n — bi. 



We start with the intuition behind our algorithm to approximate the value of W called Weighted- 
Integrals- And-Sums-By-Hashing (WISH). 

Computing W as defined in Equation ([!]) is challenging because the sum is defined over an 
exponentially large number of items, i.e., |S| = 2 n when there are n binary variables. Let us define 




4 The WISH Algorithm 
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Figure 1: Visualization of the "thinning" effect of random parity constraints, after adding 0, 1, 
2, and 3 parity constraints. Leftmost plot shows the original function to integrate. Constrained 
optimal solution in red. 
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Figure 2: Horizontal vs. vertical slices for integration. 



the tail distribution of weights as G(u) = \{o~ | w(cr) > u}\. Note that G is a non-increasing step 
function, changing values at no more than 2 n points. Then W may be rewritten as J R+ G(u)du, 
i.e., the total area A under the G(u) vs. u curve. One way to approximate W is to (implicitly) 
divide this area A into either horizontal or vertical slices (see Figure [4]) , approximate the area in 
each slice, and sum up. 

Suppose we had an efficient procedure to estimate G{u) given any u. Then it is not hard to 
see that one could create enough slices by dividing up the x-axis, estimate G(u) at these points, 
and estimate the area A using quadrature. However, the natural way of doing this to any degree 
of accuracy would require a number of slices that grows at least logarithmically with the weight 
range on the x-axis, which is undesirable. 

Alternatively, one could split the y-axis, i.e., the G(u) value range [0,2 n ], at geometrically 
growing values 1, 2, 4, • • • , 2 n , i.e., into bins of sizes 1, 1, 2, 4, • • • , 2 n_1 . Let &o > bi > • • • > b n be 
the weights of the configurations at the split points. In other words, bi is the 2*-th quantile of the 
weight distribution. Unfortunately, despite the monotonicity of G(u), the area in the horizontal 
slice defined by each bin is difficult to bound, as bi and bi+\ could be arbitrarily far from each other. 
However, the area in the vertical slice defined by bi and foj+i must be bounded between 2 l {bi — 
and 2 l+1 (6j — 6j+i), i.e., within a factor of 2. Thus, summing over the lower bound for all such slices 
and the left-most slice, the total area A must be within a factor of 2 of X^o 1 ^*(^j — h+i) + 2 n 6 n = 
&o + X^iLi 2* £>i. Of course, we don't know bi. But if we could approximate each bi within a factor 
of p, we would get a 2p-approximation to the area A, i.e., to W. 

WISH provides an efficient way to realize this strategy, using a combination of randomized hash 
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Algorithm 1 WISH (w : S -»• M + ,n = log 2 |E|,<5,a) 



r 



ln ^^^ Inn 

for i = 0, • • • , n do 
for i = 1,- • • ,T do 

Sample hash function /i^ 6 : E — > {0, l} 4 , i.e. sample uniformly A £ {0, l} iXn , 6 G {0, 1} J 
w\ <— maxo- u;(cr) subject to Aa = 6 mod 2 
end for 

Mj «- Median^ 1 , • • • , wf) 
end for 

Return M + £Xo ^+i2 4 



functions and an optimization oracle to approximate the bi values with high probability. Note that 
this method allows us to compute the partition function W (or the area A) by estimating weights 
hi at n + 1 carefully chosen points, which is "only" an optimization problem. 

The key insight to compute the bi values is as follows. Suppose we apply to configurations in E 
a randomly sampled pairwise independent hash function with 2 m buckets and use an optimization 
oracle to compute the weight w m of a heaviest configuration in a fixed (arbitrary) bucket. If we 
repeat this process T times and consistently find that w m > w* , then we can infer by the properties 
of hashing that at least 2 m configurations (globally) are likely to have weight at least w* . By the 
same token, if there were in fact at least 2 m+c configurations of a heavier weight w > w* for some 
c > 0, there is a good chance that the optimization oracle will find w m > w and we would not 
underestimate the weight of the 2 m -th heaviest configuration. As we will see shortly, this process, 
using pairwise independent hash functions to keep variance low, allows us to estimate bi accurately 
with only T = O(lnn) samples. 

The pseudocode of WISH is shown as Algorithm [l| It is parameterized by the weight function 
w, the dimensionality n, a correctness parameter 8 > 0, and a constant a > 0. Notice that the 
algorithm requires solving only 0(n Inn In 1/(5) optimization instances (MAP inference) to compute 
a sum defined over 2 n items. In the following section, we formally prove that the output is a 
constant factor approximation of W with probability at least 1 — 5 (probability over the choice 
of hash functions). Figure [T] shows the working of the algorithm. As more and more random 
parity constraints are added in the outer loop of the algorithm ("levels" increasing from 1 to n), 
the configuration space is (pairwise-uniformly) thinned out and the optimization oracle selects the 
heaviest (in red) of the surviving configurations. The final output is a weighted sum over the 
median of T such modes obtained at each level. 

Remark 1. The parity constraints Aa = b mod 2 do not change the worst-case complexity of 
an NP-hard optimization problem. Our result is thus consistent with the fact that #P can be 
approximated in BPP NP , that is, one can approximately count the number of solutions with a 
randomized algorithm and a polynomial number of queries to an NP oracle |1Q) . 

Remark 2. Although the parity constraints we impose are simple linear equations over a field, 
they can make the optimization harder. For instance, finding a configuration with the smallest 
Hamming weight satisfying a set of parity constraints is known to be NP-hard, i.e. equivalent to 
computing the minimum distance of a parity code j3j [30] - On the other hand, most low density 
parity check codes can be solved extremely fast in practice using heuristic methods such as message 



6 



passing. 



Remark 3. Each of the optimization instances can be solved independently, allowing natural 
massive parallelization. We will also discuss how the algorithm can be used in an anytime 
fashion, and the implications of obtaining suboptimal solutions. 



5 Analysis 

Since many configurations can have identical weight, it will help for the purposes of the analysis to 
fix, w.l.o.g., a weight-based ordering of the configurations, and a natural partition of the |S| = 2 n 
configurations into n + 1 bins that the ordering induces. 

Definition 2. Fix an ordering a, 1 < % < 2 n , of the configurations in £ such that for 1 < j < 2 n , 
w((Tj) > w(aj+i). For i G {0, 1, • • • , n}, define bi = w(a 2 i)- Define a special bin B = {<Ji} and, for 
i G {0, 1, • • ■ , n — 1}, define bin Bi = {a 2 i + i, 0"2 i +2> " " ; cr 2 i + 1 }- 

Note that bin Bi has precisely 2 % configurations. Further, for all a £ Bi, it follows from the 
definition of the ordering that w(a) E \bi+i, bi]. This allows us to bound the sum of the weights of 
configurations in Bi (the "horizontal" slices) between 2 % bi+\ and 2 l bi. 



5.1 Estimating the Total Weight 

Our main theorem is that Algorithm [T] provides a constant factor approximation to the partition 
function. 

Theorem 1. For any 5 > and positive constant a < 0.0042, Algorithm^ makes 0(nlnnln 1/S) 
MAP queries and, with probability at least (1 — 5), outputs a 16- approximation ofW = X^o-es w (°~)- 

The proof relies on two intermediate results whose proofs may be found in the Appendix. 

Lemma 1. Let Mi = Median(u^, • • • ,wf) be defined as in Algorithm^ and bi as in Definition^ 
Then, for all c > 2, there exists an a*(c) > such that for < a < a*(c), 

Pr [Mi G [&min{i+c,n},&ma x {i- C ,0}]] > 1 ~ exp(-aT) 

Lemma 2. Let L' = b + E^ 1 Vin{i+c+i,n}2 i and U' = b + ^=0 K^{i+i-cfl}^ ■ Then U' < 
2 2c L'. 

Proof of Theorem^ It is clear from the pseudocode of Algorithm [l] that it makes 0(nlnnlnl/5) 
MAP queries. For accuracy analysis, we can write W as: 

2 n n-1 

j=l i=0 aeBi 

n—1 n—1 

6o + 5Z^+i2 i ,6o + ^^2 i 4[L,t/] 

i=0 i=0 

Note that U < 2L because 2L = 2b + E7=o b i+ i2 i+1 = 2b + ELi h & 1 = ^0 + E"=o b ^ ^ U - 
Hence, if we had access to the true values of all bi, we could obtain a 2-approximation to W . 
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We do not know true bi values, but Lemma [T] shows that the Mi values computed by Algorithm[T] 
are sufficiently close to b% with high probability. Recall that Mj is the median of MAP values 
computed by adding i random parity constraints and repeating the process T times. Specifically, 
for c > 2, it follows from Lemma [T] that for < a < a*(c), 



Pr 



P| (Mi G [fr m in{i+c,?i}> frmax{i-c,0}]) 

> 1 - nexp(-aT) > (1 - 5) 



i=0 



for T 



log(Vj) 
a 



logn, and Mq = 6q- Thus, with probability at least (1 — 5) the output of Algorithm 



1[ M + JXo' M i+ i2 J , lies in the range: 



n-1 



n-1 



^min{j+c+l,n}-^ > "max{i+l— c,0} " 



i=0 



i=0 



Let us denote this range [L' , U']. By monotonicity of bi, V < L < U < U' . Hence, W € [L', U']. 

Applying Lemma |2| we have U' < 2 2c L', which implies that with probability at least 1 — 5 the 
output of Algorithm 111 is a 2 2c approximation of W. For c = 2, observing that a* (2) > 0.0042 (see 
proof of Lemma [TJ, we obtain a 16- approximation for < a < 0.0042. □ 

5.2 Estimating the Tail Distribution 

We can also estimate the entire tail distribution of the weights, defined as G(u) = |{<r | w(a) > u}\. 

Theorem 2. Let Mi be defined as in Algorithm^ u E M + , and q(u) be the maximum i such that 
Vj G {0, • • • , i}, Mj > u. Then, for any 5 > 0, with probability > (1 — 5), 2 q ^ is an 8- approximation 
of G(u) computed using 0(ralnnln 1/5) MAP queries. 

While this is an interesting result in its own right, if the goal is to estimate the total weight 
W, then the scheme in Section 5.1, requiring a total of only G(nlnnln 1/5) MAP queries, is more 
efficient than first estimating the tail distribution for several values of u. 



5.3 Improving the Approximation Factor 

Given a ^-approximation algorithm such as Algorithm [l] and any e > 0, we can design a (1 + e)- 
approximation algorithm with the following construction. Let t = log 1+(E K. Define a new set of 
configurations = S x S x • • • x S, and a new weight function w' : £ — > M as w'(a\, • • • , o~t) = 
w(a\)w(a2) ■ ■ ■ w(ai). 

Proposition 2. Let W be a k- approximation o/^o-'es* w '(°~')- Then W 1 ^ is a k 1 ^ -approximation 

To see why this holds, observe that W = E CT 'es« w '( a ') = (Eaes w ( cr )Y = w& - Since \ w ' < 
W < kW' , we obtain that W 1 '* must be a n 1 ^ = 1 + e approximation of W. 

Note that this construction requires running Algorithm [T] on an enlarged problem with £ times 
more variables. Although the number of optimization queries grows polynomially with £, increasing 
the number of variables might significantly increase the runtime. 
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5.4 Further Approximations 

When the instances defined in the inner loop are not solved to optimality, Algorithm [T] still provides 
approximate lower bounds on W with high probability. 

Theorem 3. Let w\ be suboptimal solutions for the optimization problems in Algorithm^ i.e., 
w\ < w\- Let W be the output of Algorithm^ with these suboptimal solutions. Then, for any 5 > 0, 
with probability at least 1 — 5, ^ < W . 

Further, if w\ > \w\ for some L > 0, then with probability at least 1 — 5, W is a 16L- 
approximation to W . 

The output is always an approximate lower bound, even if the optimization is stopped early. 
The lower bound is monotonically non-decreasing over time, and is guaranteed to eventually reach 
within a constant factor of W . We thus have an anytime algorithm. 



6 Experimental Evaluation 

We implemented WISH using the open source solver ToulBar2 pQ to solve the MAP inference prob- 
lem. ToulBar2 is a complete solver (i.e., given enough time, it will find an optimal solution and 
provide an optimality certificate), and it was one of the winning algorithms in the UAI-2010 infer- 
ence competition. We augmented ToulBar2 with the IBM ILOG CPLEX CP Optimizer 12.3 based 
techniques borrowed from Gomes et al. [13J to efficiently handle the random parity constraints. 
Specifically, the set of equations Ax = b mod 2 are linear equations over the field F(2) and thus 
allow for efficient propagation and domain filtering using Gaussian Elimination. 

For our experiments, we run WISH in parallel using a compute cluster with 642 cores. We assign 
each optimization instance in the inner loop to one core, and finally process the results when all 
optimization instances have been solved or have reached a timeout. 

For comparison, we consider Tree Reweighted Belief Propagation [31J which provides an upper 
bound on Z, Mean Field [32] which provides a lower bound, and Loopy Belief Propagation |20j 
which provides an estimate with no guarantees. We use the implementations of these algorithms 
available in the LibDAI library |19j . 



6.1 Provably Accurate Approximations 

For our first experiment, we consider the problem of computing the partition function, Z (cf. Eqn. ([3 
of random Clique-structured Ising models on n binary variables Xi £ {0, 1} for i £ {1, • • • , n}. The 
interaction between xi and Xj is defined as tpij(xi,Xj) = exp(-Wij) when Xi / xj, and 1 otherwise, 
where Wij is uniformly sampled from [0, W\J\i — j\ ] and w is a parameter set to 0.2. We further in- 
ject some structure by introducing a closed chain of strong repulsive interactions uniformly sampled 
from [— lOu^O]. We consider models with n ranging from 10 to 60. These models have treewidth n 
and can be solved exactly (by brute force) only up to about n = 25 variables. 

Figure 4 ( a) | shows the results using various methods for varying problem size. We also computed 



ground truth for n < 25 by brute force enumeration. While other methods start to diverge from the 
ground truth at around n = 25, our estimate, as predicted by Theorem [TJ remains very accurate, 
visually overlapping in the plot. The actual estimation error is much smaller than the worst-case 
factor of 16 guaranteed by Theorem [TJ as in practice over- and under-estimation errors tend to 
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cancel out. For n > 25 we don't have ground truth, but other methods fall well outside the 
provable interval provided by WISH, reported as an error bar that is very small compared to the 
magnitude of errors made by the other methods. 

All optimization instances generated by WISH for n < 60 were solved (in parallel) to optimality 
within a timeout of 8 hours, resulting in high confidence tight approximations of the partition 
function. We are not aware of any other practical method that can provide such guarantees for 
counting problems of this size, i.e., a weighted sum defined over 2 60 items. 

6.2 Anytime Usage with Suboptimal Solutions 

Next, we investigate the quality of our results when not all of the optimization instances can be 
solved to optimality because of timeouts, so that the strong theoretical guarantees of Theorem [T] 
do not apply (although Theorem [3] still applies). We consider 10 x 10 binary Grid Ising models, 
for which ground truth can be computed using the junction tree method [32]. We use the same 
experimental setup as Hazan and Jaakkola [H] , who also use random MAP queries to derive bounds 
(without a tightness guarantee) on the partition function. Specifically, we have n = 100 binary 
variables X{ 6 { — 1, 1} with interaction tpij(xi, Xj) = exp(wijXiXj). For the attractive case, we draw 
Wij from [0, w]; for the mixed case, from [— w,w]. The "local field" is ipij(xi) = exp(fiXi) where /j, 
the strength at site i, is sampled uniformly from [— /, /], where / is a parameter with value 0.1 or 
1.0. 

Figure [3] reports the estimation error for the log-partition function, when using a timeout of 15 
minutes. We see that WISH provides accurate estimates for a wide range of weights, often improving 
over all other methods. The slight performance drop of WISH for coupling strengths w ~ 1 appears 
to occur because in that weight range the terms corresponding to i ~ n/2 parity constraints are 
the most significant in the output sum Mq + ^^Zq 1 Mj+i2\ Empirically, optimization instances 
with roughly n/2 parity constraints are often the hardest to solve, resulting in possibly a significant 
underestimation of the value of W = Z when a timeout occurs. We do not directly compare with 
the work of Hazan and Jaakkola [14J as we did not have access to their code. However, a visual 
look at their plots suggests that WISH would provide an improvement in accuracy, although with 
longer runtime. 

6.3 Hard Combinatorial Structures 

An interesting and combinatorially challenging graphical model arises from Sudoku, which is a 
popular number-placement puzzle where the goal is to fill a 9 x 9 grid (see Figure [4 (b)[ ) with digits 
from {1, • • • ,9} so that the entries in each row, column, and 3x3 block composing the grid, are all 
distinct. The puzzle can be encoded as a graphical model with 81 discrete variables with domain 
{1, • • • , 9}, with potentials ip a ({x} a ) = 1 if and only if all variables in {x} a are different, and a£l 
where I is an index set containing the subsets of variables in each row, column, and block. This 
defines a uniform probability distribution over all valid complete Sudoku grids (a non-valid grid 
has probability zero), and the normalization constant Z s equals the total number of valid grids. 
It is known that Z s = 6.671 x 10 21 . This number was computed exactly with a combination of 
computer enumeration and clever exploitation of properties of the symmetry group [8] . Here, we 
attempt to approximately compute this number using the general-purpose scheme WISH. 

First, following Felgenhauer and Jarvis [8], we simplify the problem by fixing the first block 
as in Figure |4(b)| obtaining a new problem over 72 variables whose normalization constant is 
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Figure 3: Estimation errors for the log-partition function on 10 x 10 randomly generated Ising 
Grids. 



Z' = Z s /9l rj 2 54 . Next, since we are dealing with a feasibility rather than optimization problem, 
we replace ToulBar2 with CryptoMiniSAT |26j . a SAT solver designed for unweighted cryptographic 
problems and which natively supports parity constraints. We observed that WISH can consistently 
find solutions (60% of the times) after adding 52 random parity constraints, while for 53 constraints 
the success rate drops below 0.5, at 45%. Therefore Mj = 1 in Algorithm [T] for i < 52 and there 
should thus be at least 2 52 • 9! ~ 1.634 x 10 21 solutions to the Sudoku puzzle. Although Theorem 
[T] cannot be applied due to timeouts for larger values of i, this estimate is clearly very close to 
the known true count. In contrast, the simple "local reasoning" done by variational methods is 
not powerful enough to find even a single solution. Mean Field and Belief Propagation report an 
estimated solution count of exp(— 237.921) and exp(— 119.307), resp., on a relaxed problem where 
violating a constraint gives a penalty exp(— 10). 



6.4 Model Selection 

Many inference and learning tasks require computing the normalization constant of graphical mod- 
els. For instance, it is needed to evaluate the likelihood of observed data for a given model. This 
is necessary for Model Selection, i.e., to rank candidate models, or to trigger early stopping during 
training when the likelihood of a validation set starts to decrease, in order to avoid overfitting |6j. 
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Figure 4: Clique Ising models, hard combinatorial structure of Sudoku, and Model Selection for 
hand-written digits. 
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We train Restricted Boltzmann Machines (RBM) [15J using Contrastive Divergence (CD) [5], [33] 
on MNIST hand-written digits dataset. In an RBM there is a layer of nh hidden binary variables 
h = hi,'" ,h nfi and a layer of n v binary visible units v = vi,-— ,v nv . The joint probability 
distribution is given by P(h,v) = \ ex.p(b'v + c'h + h'Wv). We use nh = 50 hidden units and 
n v = 196 visible units. We learn the parameters b,c,W using CD-/c for k £ {1,10,15}, where k 
denotes the number of Gibbs sampling steps used in the inference phase, with 15 training epochs 
and minibatches of size 20. 

Figure 4(c) depicts confabulations (samples generated with Gibbs sampling) from the three 
learned models. To evaluate the loglikelihood of the data and determine which model is the best, 
one needs to compute Z. We use WISH to estimate this quantity, with a timeout of 10 minutes, 
and then rank the models according to the average loglikelihood of the data. The scores we obtain 
are —41.70,-40.35,-40.01 for k = 1,10,15, respectively (larger scores means higher likelihood). 
In this case ToulBar2 was not able to prove optimality for all instances, so only Theorem [3] applies 
to these results. Although we do not have ground truth, it can be seen that the ranking of the 
models is consistent with what visually appears closer to a large collection of hand-written digits 
in Figure 4(c) Note that k = 1 is clearly not a good representative, because of the highly uneven 
distribution of digit occurrences. The ranking of WISH is also consistent with the fact that using 
more Gibbs sampling steps in the inference phase should provide better gradient estimates and 
therefore a better learned model. In contrast, Mean Field results in scores —35.47, —36.08, —36.84, 
resp., and would thus rank the models in reverse order of what is visually the most representative 
order. 



7 Conclusion 

We introduced WISH, a randomized algorithm that, with high probability, gives a constant-factor 
approximation of a general discrete integral defined over an exponentially large set. WISH reduces 
the intractable counting problem to a small number of instances of a combinatorial optimization 
problem subject to parity constraints used as a hash function. In the context of graphical models, 
we showed how to approximately compute the normalization constant, or partition function, using a 
small number of MAP queries. Using state-of-the-art combinatorial optimization tools, we are thus 
able to provide discrete integral or partition function estimates with approximation guarantees 
at a scale that could till now be handled only heuristically. Finally, our method is a massively 
parallelizable and anytime algorithm which can also be stopped early to obtain empirically accurate 
estimates that provide lower bounds with a high probability. 
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A Appendix: Proofs 

Proof of Proposition [7} Immediately follows from Lemma |3j □ 

Lemma 3 (pairwise independent hash functions construction). Let a £ {0, l} n , b £ {0, 1}. Then 
the family % = {h a b(x) : {0, 1}™ — > {0, 1}} where h a b(x) = a ■ x + b mod 2 is a family of pairwise 
independent hash functions. The function h a fi{x) can be alternatively rewritten in terms of XORs 
operations ffi, i.e. h a ^(x) = a\X\ ffi 02^2 ffi • • • ffi a n x n © b. 

Proof. Uniformity is clear because it is the sum of uniform Bernoulli random variables over the 
field F(2) (arithmetic modulo 2). For pairwise independence, given any two configurations x\,x 2 € 
{0, l} n , consider the sets of indexes S\ = {i : x\{i) = 1}, S2 = {i : £2(2) = 1}- Then 

H(xi)= ^2 a i® ^2 ai ® h 

i&s 1 ns 2 ieSi\S 2 

R(Si n S 2 ) © R(Si \S 2 )®b 

H(x 2 )= R(S 1 nS 2 )®R(S 2 \S 1 )®b 

Note that ii(Si n 52), R(Si \ S 2 ), R(S 2 \ Si) and b are independent as they depend on disjoint 
subsets of independent variables. When x\ 7^ x 2 , this implies that (H(xi), H(x 2 )) takes each value 
in {0, l} 2 with probability 1/4. □ 
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As pairwise independent random variables are fundamental tools for derandomization of algo- 
rithms, more complicated constructions based larger finite fields generated by a prime power ¥(q k ) 
where q is a prime number are known [27]. These constructions require a smaller number of ran- 
dom bits as input, and would therefore reduce the variance of our algorithm (which is deterministic 
except for the randomized hash function use). 

Proof of Lemma^ The cases where i + c>n or i — c<0 are obvious. For the other cases, let's 
define the set of the 2 J heaviest configurations as in Definition k| 



{0-1,02, • • • ,o- 2J } 



Define the following random variable 

Sj(h 



Ah) 



■b mod 2} 



ere*, 



which gives the number of elements of X» satisfying i random parity constraints. The randomness 
is over the choice of A and b, which are uniformly sampled in {0, l} iXn and {0, l} 1 respectively. By 
Proposition ll h Ab : X — > {0, 1}' is sampled from a family of pairwise independent hash functions. 



Therefore, from the uniformity property in DefinitionjTJ for any a the random variable l^ a=b mod 2 } 
is Bernoulli with probability 1/2*. By linearity of expectation, 



E[Sj(h\ b )] 



2-' 



Further, from the pairwise independence property in Definition [TJ 

Var[Sj(h\ b )] = ^ Var [l { A CT =&mod 



2}J 



Applying Chebychev Inequality, we get that for any k > 

Sj(h l A,b 



Pr 



2J 
2 1 



> k 




< 



k 2 



Recall the definition of the random variable Wi 
randomness is over the choice of A and b). Then 



maxo- w(a) subject to Aa = b mod 2 (the 



PrK > bj) = Pv[wi > w(a 2J )} > Pv[Sj(h\ b ) > 1] 

which is the probability that at least one configuration from Xj "survives" after adding i parity 
constraints. 

To ensure that the probability bound 1/k 2 provided by Chebychev Inequality is smaller than 
a 1/2, we need k > y/2. We use k = 3/2 for the rest of this proof, exploiting the following simple 
observations which hold for k = 3/2 and any c > 2: 



kV2 c < 2 C 
k\f2r c < 1 



- 1 

2- c 
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For j = i + c and k and c as above, we have that 



PrK > b i+ c) > Pr[S i+c (h\ b ) > 1] > 
Pv[\S i+c {h i )-2 c \ <2 C -1] > 



Pr 



Pr 

pWc(AV&) ■ 



|S. J+c (/i i )-2 c | <fc^ 



2 C < k* 2 C 1 



> 



> 



1 " p = 5 /9 > 1/2 



Similarly, for j = i — c and and c as above, we have Pr[u?j < 6j_ c ] > 5/9 > 1/2. 
Finally, using Chernoff inequality (since wj, ■ ■ ■ ,wf are i.i.d. realizations of wi) 

Pr [Mi < b^ c ] > 1 - exp(-a / (c)T) 
Pr [Mi > b l+c ] > 1 - exp(-c/(c)T) 

where a' (2) = 2(5/9 — 1/2) 2 , which gives the desired result 

Pr [b i+c < Mi < bi„ c ] > 1 - 2exp(a'(c)T) 
= 1 — exp(— a*(c)T) 

where a* (2) = ln2a'(2) = 2(5/9 - l/2) 2 ln2 > 0.0042 

Proof of Lemma^ Observe that we may rewrite V as follows: 

Jl— 1 n—c—2 



i=n—c—\ 
n-1 



i=0 



n-1 



b + bn2i+ h^~ c ~ x 

j=c+l 



i=n—c—l 
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Similarly, 



c—1 n— 1 



i=0 i=c 

n—c 



c—1 n—c 

b + Y + Y b^- 1 = 2 c b + 2 C Y bj^ 1 = 

i=0 j=l j=l 

(c n—c \ 

Y h i- 1 • £ b y- 1 

J • 3=c+l ) 

(c n—c 
^ 6 2 J " 1 + Y h P?~ l 
j=l j=c+l 
n—c 

2 2c b + 2 2c Y ''J-' ' ' ' 



j=c+l 



n-1 



\ i=n— c— 1 j=c+l I 

This finishes the proof. □ 
Proof of Theorem [£| As in the proof of Lemma [TJ define the random variable 

Su{h % A b ) = ^ l{Acr=fe mod 2} 

o-£{<7|mj(<t)>u} 

that gives the number of configurations with weight at least u satisfying i random parity constraints. 
Then for i < LlogG(u)J — c < log G(u) — c using Chebychev and Chernoff inequalities as in Lemma 

m 

Pr [Mi > u] > 1 - exp(-a'T) 
For i > [logG(u)] + c > logG(ii) + c, using Chebychev and Chernoff inequalities as in Lemma [I] 

Pr[Mj < u] > 1 - exp(-a'T) 

Therefore, 



Pr 



_L 2 ' ? ( M ) < G(u) < 2 c+l 2 q{u 



Pr 



Llog 2 G(«)J- C 

H Wi >^n( M riog 2 G(n)l+c 

i=0 



< u) 

1 — nexp(— a'T) > 1 — 5 



> 



> 



This finishes the proof. 



□ 
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Proof of Theorem [3| If w\ < w\ , from Theorem l] with probability at least 1 — 5 we have W < 
M + YJIIq M i+i 2 i < UB'. Since 20- < LB' < W < UB', it follows that with probability at least 



1-5, 



if 



< W. 



If w\ > w\ > j^wj, then from Theorem [l] with probability at least 1 — 5 the output is j^LB' < 
W < UB', and LB' < W < UB'. □ 
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